Skip to content

Conversation

ycchenzheng
Copy link
Collaborator

Fixes / Features

  • Merge --base-docker-image and --docker-image flag

Testing / Documentation

Tested with https://github.com/AI-Hypercomputer/maxtext/blob/wstcliyu/pw-405b-scale-test/benchmarks/recipes/pw_mcjax_benchmark_recipe.py for both mcjax and pathways
Changed https://github.com/AI-Hypercomputer/maxtext/blob/wstcliyu/pw-405b-scale-test/benchmarks/maxtext_xpk_runner.py#L624 to

    docker_image_flag = f'--docker-image="{wl_config.base_docker_image}"'

mcjax uses RUNNER = "maxtext_base_image" and pathways uses RUNNER="gcr.io/tpu-prod-env-multipod/wstcliyu_latest:latest" as runner image
mcjax will push local maxtext_base_image to remote for pods to pull and pathways will pull images directly from the remote.
XPK log:

[XPK] Building /usr/local/google/home/chzheng/maxtext into docker image.
[XPK] Task: `Building script_dir into docker image` is implemented by `docker buildx build --platform=linux/amd64 -f /tmp/tmpvl105_wh -t chzheng-runner /usr/local/google/home/chzheng/maxtext`, streaming output live.
[+] Building 0.0s (0/1)                                                                                                                                  docker:default
[+] Building 0.9s (1/2)                                                                                                                                  docker:default
[+] Building 2.0s (6/8)                                                                                                                                  docker:default
[+] Building 3.0s (8/9)                                                                                                                                  docker:default
[+] Building 3.8s (9/9) FINISHED                                                                                                                         docker:default
 => [internal] load build definition from tmpvl105_wh                                                                                                              0.0s
 => => transferring dockerfile: 212B                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/python:3.10                                                                                                     1.3s
 => [internal] load .dockerignore                                                                                                                                  0.0s
 => => transferring context: 45B                                                                                                                                   0.0s
 => [1/4] FROM docker.io/library/python:3.10@sha256:6ff000548a4fa34c1be02624836e75e212d4ead8227b4d4381c3ae998933a922                                               0.0s
 => [internal] load build context                                                                                                                                  0.0s
 => => transferring context: 39.83kB                                                                                                                               0.0s
 => CACHED [2/4] WORKDIR /app                                                                                                                                      0.0s
 => [3/4] COPY . .                                                                                                                                                 1.3s
 => [4/4] WORKDIR /app                                                                                                                                             0.0s
 => exporting to image                                                                                                                                             1.0s
 => => exporting layers                                                                                                                                            1.0s
 => => writing image sha256:8f7f59fdd22171fa0ac861a9e7559c4c58f80978950a7bb04eaf2ee37f004ffd                                                                       0.0s
 => => naming to docker.io/library/chzheng-runner                                                                                                                  0.0s
Waiting for `chzhe-pw-2-wtf`, for 14 secondsdocker image`, for 4 seconds...
[XPK] Task: `Building script_dir into docker image` terminated with code `0`
[XPK] Adding Docker Image: gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39 to tpu-prod-env-one-vm
[XPK] Task: `Tag Docker Image` is implemented by `docker tag chzheng-runner gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39`, streaming output live.
Waiting for `chzhe-pw-2-wtf`, for 15 secondsseconds...
[XPK] Task: `Tag Docker Image` terminated with code `0`
[XPK] Task: `Upload Docker Image` is implemented by `docker push gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39`, streaming output live.
Waiting for `chzhe-pw-2-wtf`, for 16 seconds 0 seconds...
Waiting for `chzhe-pw-2-wtf`, for 17 seconds 1 seconds...
The push refers to repository [gcr.io/tpu-prod-env-one-vm/chzheng-runner]
5f70bf18a086: Layer already exists 
917a4b2a5731: Pushing [=>                                                 ]  7.141MB/191.7MB
917a4b2a5731: Pushing [=====>                                             ]  20.51MB/191.7MB
917a4b2a5731: Pushing [========>                                          ]   33.3MB/191.7MB
917a4b2a5731: Pushing [===========>                                       ]  44.44MB/191.7MB
917a4b2a5731: Pushing [==============>                                    ]  54.42MB/191.7MB
917a4b2a5731: Pushing [================>                                  ]  64.45MB/191.7MB
917a4b2a5731: Pushing [====================>                              ]  77.25MB/191.7MB
917a4b2a5731: Pushing [=======================>                           ]  88.93MB/191.7MB
917a4b2a5731: Pushing [==========================>                        ]  101.2MB/191.7MB
917a4b2a5731: Pushing [=============================>                     ]  112.9MB/191.7MB
917a4b2a5731: Pushing [================================>                  ]  124.6MB/191.7MB
917a4b2a5731: Pushing [===================================>               ]  135.1MB/191.7MB
917a4b2a5731: Pushing [======================================>            ]  146.3MB/191.7MB
917a4b2a5731: Pushing [========================================>          ]  156.3MB/191.7MB
917a4b2a5731: Pushing [===========================================>       ]    168MB/191.7MB
917a4b2a5731: Pushing [==============================================>    ]  179.6MB/191.7MB
917a4b2a5731: Pushing [=================================================> ]  190.2MB/191.7MB
917a4b2a5731: Pushed 
Waiting for `chzhe-pw-2-wtf`, for 27 seconds 11 seconds...
Waiting for `chzhe-pw-2-wtf`, for 28 seconds 12 seconds...
Waiting for `chzhe-pw-2-wtf`, for 29 seconds 13 seconds...
Waiting for `chzhe-pw-2-wtf`, for 30 seconds 14 seconds...
Waiting for `chzhe-pw-2-wtf`, for 31 seconds 15 seconds...
Waiting for `chzhe-pw-2-wtf`, for 32 seconds 16 seconds...
Waiting for `chzhe-pw-2-wtf`, for 33 seconds 17 seconds...
Waiting for `chzhe-pw-2-wtf`, for 34 seconds 18 seconds...
Waiting for `chzhe-pw-2-wtf`, for 35 seconds 19 seconds...
Waiting for `chzhe-pw-2-wtf`, for 36 seconds 20 seconds...
xitg-2025-08-08-17-56-39: digest: sha256:766a71cc50f9c8100c98ccba5d451dbbb82d15f6b275ee884f1b2cb278153f36 size: 2420
[XPK] Task: `Upload Docker Image` terminated with code `0`

Pod log:

Events:
  Type     Reason                           Age                From               Message
  ----     ------                           ----               ----               -------
  Normal   Scheduled                        32s                default-scheduler  Successfully assigned default/chzhe-pw-2-wtf-slice-job-1-0-t8nn6 to gke-tpu-e96bd525-tn7c
  Normal   Pulling                          32s                kubelet            Pulling image "gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39"
  Normal   Pulled                           28s                kubelet            Successfully pulled image "gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39" in 4.365s (4.365s including waiting). Image size: 455215023 bytes.
  Normal   Created                          28s                kubelet            Created container: jax-tpu
  Normal   Started                          27s                kubelet            Started container jax-tpu
  Warning  FailedToRetrieveImagePullSecret  26s (x3 over 32s)  kubelet            Unable to retrieve some image pull secrets (None); attempting to pull the image may not succeed.

  • [ y ] Tests pass
  • [ y ] Appropriate changes to documentation are included in the PR

@ycchenzheng ycchenzheng self-assigned this Aug 8, 2025
@ycchenzheng
Copy link
Collaborator Author

@SujeethJinesh

@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from 3d26153 to 954377f Compare August 8, 2025 18:39
@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch 2 times, most recently from 36b0cfc to 95e6fa0 Compare August 11, 2025 01:04
@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from 95e6fa0 to b70f1b8 Compare August 11, 2025 18:05
@SujeethJinesh
Copy link
Collaborator

Hmm, I'm rethinking if we should be merging these flags together. I think we should still support both of these flags, but when we're using the benchmark runner in maxtext, we should support Pathways being able to use --base-docker-image or --docker-image

@ycchenzheng
Copy link
Collaborator Author

Hmm, I'm rethinking if we should be merging these flags together. I think we should still support both of these flags, but when we're using the benchmark runner in maxtext, we should support Pathways being able to use --base-docker-image or --docker-image

https://github.com/AI-Hypercomputer/xpk/blob/chzheng/docker_image_flag/src/xpk/core/docker_image.py#L228 will check --docker-image -> --base-docker-image -> DEFAULT_DOCKER_IMAGE
This change is still back compatible

Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Zheng!

Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Zheng!

Once the commented code is removed, then it looks good to me.

@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from 6bf4461 to 08a7f36 Compare August 12, 2025 21:45
@ycchenzheng
Copy link
Collaborator Author

Thanks Zheng!

Once the commented code is removed, then it looks good to me.

Done

@SujeethJinesh
Copy link
Collaborator

@scaliby Would you be able to take a look at this PR?

Copy link
Collaborator

@scaliby scaliby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change! Could you please also add XPK execution command to the XPK log, so I will see with what arguments XPK has been executed?

@ycchenzheng
Copy link
Collaborator Author

Thanks for this change! Could you please also add XPK execution command to the XPK log, so I will see with what arguments XPK has been executed?
This was used with MaxText benchmark recipes

~/maxtext$ PYTHONPATH=. python3 benchmarks/recipes/pw_mcjax_benchmark_recipe.py

It will call https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/maxtext_xpk_runner.py which calls xpk.py
If using a local base docker image, the XPK command:

python3 ~/xpk/xpk.py workload create  --cluster=pw-scale-test-v5e-32 --project=cloud-tpu-multipod-dev --zone=us-south1-a  --device-type=v5litepod-32  --num-slices=2 --command="export PROJECT=cloud-tpu-multipod-dev && export CLUSTER=pw-scale-test-v5e-32 && export ZONE=us-south1-a &&  echo LIBTPU_INIT_ARGS=' --xla_tpu_scoped_vmem_limit_kib=98304 --xla_tpu_use_minor_sharding_for_major_trivial_input=true --xla_tpu_relayout_group_size_threshold_for_reduce_scatter=1 --xla_tpu_assign_all_reduce_scatter_layout=true --xla_tpu_enable_data_parallel_all_reduce_opt=true --xla_tpu_data_parallel_opt_different_sized_ops=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_overlap_compute_collective_tc=true --xla_enable_async_all_gather=true --xla_tpu_enable_async_collective_fusion_fuse_all_reduce=false --xla_tpu_enable_sparse_core_collective_offload_all_reduce=true --xla_tpu_enable_all_reduce_offload_tracing=true --xla_tpu_use_tc_device_shape_on_sc=true --xla_sc_enable_instruction_fusion=false --xla_sc_disjoint_spmem=false --xla_sc_disable_megacore_partitioning=true --2a886c8_chip_config_name=megachip_tccontrol --xla_tpu_enable_all_experimental_scheduler_features=true --xla_tpu_enable_scheduler_memory_pressure_tracking=true --xla_tpu_host_transfer_overlap_limit=24 --xla_tpu_aggressive_opt_barrier_removal=ENABLED --xla_lhs_prioritize_async_depth_over_stall=ENABLED --xla_tpu_enable_ag_backward_pipelining=true --xla_should_allow_loop_variant_parameter_in_chain=ENABLED --xla_should_add_loop_invariant_op_in_chain=ENABLED --xla_max_concurrent_host_send_recv=100 --xla_tpu_scheduler_percent_shared_memory_limit=100 --xla_latency_hiding_scheduler_rerun=2' && export LIBTPU_INIT_ARGS=' --xla_tpu_scoped_vmem_limit_kib=98304 --xla_tpu_use_minor_sharding_for_major_trivial_input=true --xla_tpu_relayout_group_size_threshold_for_reduce_scatter=1 --xla_tpu_assign_all_reduce_scatter_layout=true --xla_tpu_enable_data_parallel_all_reduce_opt=true --xla_tpu_data_parallel_opt_different_sized_ops=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_overlap_compute_collective_tc=true --xla_enable_async_all_gather=true --xla_tpu_enable_async_collective_fusion_fuse_all_reduce=false --xla_tpu_enable_sparse_core_collective_offload_all_reduce=true --xla_tpu_enable_all_reduce_offload_tracing=true --xla_tpu_use_tc_device_shape_on_sc=true --xla_sc_enable_instruction_fusion=false --xla_sc_disjoint_spmem=false --xla_sc_disable_megacore_partitioning=true --2a886c8_chip_config_name=megachip_tccontrol --xla_tpu_enable_all_experimental_scheduler_features=true --xla_tpu_enable_scheduler_memory_pressure_tracking=true --xla_tpu_host_transfer_overlap_limit=24 --xla_tpu_aggressive_opt_barrier_removal=ENABLED --xla_lhs_prioritize_async_depth_over_stall=ENABLED --xla_tpu_enable_ag_backward_pipelining=true --xla_should_allow_loop_variant_parameter_in_chain=ENABLED --xla_should_add_loop_invariant_op_in_chain=ENABLED --xla_max_concurrent_host_send_recv=100 --xla_tpu_scheduler_percent_shared_memory_limit=100 --xla_latency_hiding_scheduler_rerun=2' && export ENABLE_PATHWAYS_PERSISTENCE=1 && export JAX_PLATFORMS=tpu,cpu && export ENABLE_PJRT_COMPATIBILITY=true &&  python3 -m MaxText.train MaxText/configs/base.yml per_device_batch_size=2 ici_fsdp_parallelism=-1 remat_policy=custom decoder_layer_input=offload out_proj=offload query_proj=offload key_proj=offload value_proj=offload max_target_length=8192 attention=flash use_iota_embed=True dataset_path=gs://max-datasets-rogue dataset_type=synthetic enable_checkpointing=False sa_block_q=2048 sa_block_kv=2048 sa_block_kv_compute=2048 sa_block_q_dkv=2048 sa_block_kv_dkv=2048 sa_block_kv_dkv_compute=2048 sa_block_q_dq=2048 sa_block_kv_dq=2048 sa_use_fused_bwd_kernel=True profiler=xplane skip_first_n_steps_for_profiler=10 profiler_steps=5 use_vertex_tensorboard=True vertex_tensorboard_project=cloud-tpu-multipod-dev vertex_tensorboard_region=us-south1  steps=20 model_name=llama3.1-8b base_output_directory=gs://chzheng-us-south1/chzhengmcjax_2_slice_v5litepod-32_llama3_1-8b-8192-v5e-256/ use_vertex_tensorboard=false vertex_tensorboard_project="" vertex_tensorboard_region="" run_name=chzhe-mc-2-xml enable_rich_metrics=true " --docker-image="maxtext_base_image" --enable-debug-logs --workload=chzhe-mc-2-xml --priority=medium --max-restarts=0

It will build and push the image to remote

[XPK] Task: `Building script_dir into docker image` terminated with code `0`
[XPK] Adding Docker Image: gcr.io/cloud-tpu-multipod-dev/chzheng-runner:sbry-2025-08-27-19-30-28 to cloud-tpu-multipod-dev
[XPK] Task: `Tag Docker Image` is implemented by `docker tag chzheng-runner gcr.io/cloud-tpu-multipod-dev/chzheng-runner:sbry-2025-08-27-19-30-28`, streaming output live.
Waiting for `chzhe-mc-2-xml`, for 46 secondsseconds...
[XPK] Task: `Tag Docker Image` terminated with code `0`
[XPK] Task: `Upload Docker Image` is implemented by `docker push gcr.io/cloud-tpu-multipod-dev/chzheng-runner:sbry-2025-08-27-19-30-28`, streaming output live.
Waiting for `chzhe-mc-2-xml`, for 47 seconds 0 seconds...
The push refers to repository [gcr.io/cloud-tpu-multipod-dev/chzheng-runner]

And the pod event:

Events:
  Type     Reason                           Age                From               Message
  ----     ------                           ----               ----               -------
  Normal   Scheduled                        25s                default-scheduler  Successfully assigned default/chzhe-mc-2-xml-slice-job-0-0-4m5n9 to gke-tpu-8bb8a2ce-0w6w
  Normal   Pulling                          24s                kubelet            Pulling image "gcr.io/cloud-tpu-multipod-dev/chzheng-runner:sbry-2025-08-27-19-30-28"
  Normal   Pulled                           13s                kubelet            Successfully pulled image "gcr.io/cloud-tpu-multipod-dev/chzheng-runner:sbry-2025-08-27-19-30-28" in 11.473s (11.473s including waiting). Image size: 599034818 bytes.
  Normal   Created                          13s                kubelet            Created container: jax-tpu
  Normal   Started                          13s                kubelet            Started container jax-tpu
  Warning  FailedToRetrieveImagePullSecret  12s (x3 over 25s)  kubelet            Unable to retrieve some image pull secrets (None); attempting to pull the image may not succeed.

If using a remote docker image, the XPK command:

python3 ~/xpk/xpk.py workload create-pathways   --server-image=us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest   --proxy-server-image=us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest    --termination-grace-period-seconds=300  --pathways-gcs-location=gs://chzheng-us-south1/chzhengpathways_2_slice_v5litepod-32_llama3_1-8b-8192-v5e-256/  --custom-pathways-server-args="--xla_tpu_use_enhanced_launch_barrier=true"  --custom-pathways-proxy-server-args="--xla_tpu_scoped_vmem_limit_kib=98304 --xla_tpu_use_minor_sharding_for_major_trivial_input=true --xla_tpu_relayout_group_size_threshold_for_reduce_scatter=1 --xla_tpu_assign_all_reduce_scatter_layout=true --xla_tpu_enable_data_parallel_all_reduce_opt=true --xla_tpu_data_parallel_opt_different_sized_ops=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_overlap_compute_collective_tc=true --xla_enable_async_all_gather=true --xla_tpu_enable_async_collective_fusion_fuse_all_reduce=false --xla_tpu_enable_sparse_core_collective_offload_all_reduce=true --xla_tpu_enable_all_reduce_offload_tracing=true --xla_tpu_use_tc_device_shape_on_sc=true --xla_sc_enable_instruction_fusion=false --xla_sc_disjoint_spmem=false --xla_sc_disable_megacore_partitioning=true --xla_tpu_enable_all_experimental_scheduler_features=true --xla_tpu_enable_scheduler_memory_pressure_tracking=true --xla_tpu_host_transfer_overlap_limit=24 --xla_tpu_aggressive_opt_barrier_removal=ENABLED --xla_lhs_prioritize_async_depth_over_stall=ENABLED --xla_tpu_enable_ag_backward_pipelining=true --xla_should_allow_loop_variant_parameter_in_chain=ENABLED --xla_should_add_loop_invariant_op_in_chain=ENABLED --xla_max_concurrent_host_send_recv=100 --xla_tpu_scheduler_percent_shared_memory_limit=100 --xla_latency_hiding_scheduler_rerun=2 --xla_tpu_use_enhanced_launch_barrier=true"  --custom-pathways-worker-args="--xla_tpu_use_enhanced_launch_barrier=true"   --cluster=pw-scale-test-v5e-32 --project=cloud-tpu-multipod-dev --zone=us-south1-a  --tpu-type=v5litepod-32  --num-slices=2 --command="export PROJECT=cloud-tpu-multipod-dev && export CLUSTER=pw-scale-test-v5e-32 && export ZONE=us-south1-a &&    export ENABLE_PATHWAYS_PERSISTENCE=1 && export JAX_PLATFORMS=proxy && export ENABLE_PJRT_COMPATIBILITY=true &&  python3 -m MaxText.train MaxText/configs/base.yml per_device_batch_size=2 ici_fsdp_parallelism=-1 remat_policy=custom decoder_layer_input=offload out_proj=offload query_proj=offload key_proj=offload value_proj=offload max_target_length=8192 attention=flash use_iota_embed=True dataset_path=gs://max-datasets-rogue dataset_type=synthetic enable_checkpointing=False sa_block_q=2048 sa_block_kv=2048 sa_block_kv_compute=2048 sa_block_q_dkv=2048 sa_block_kv_dkv=2048 sa_block_kv_dkv_compute=2048 sa_block_q_dq=2048 sa_block_kv_dq=2048 sa_use_fused_bwd_kernel=True profiler=xplane skip_first_n_steps_for_profiler=10 profiler_steps=5 use_vertex_tensorboard=True vertex_tensorboard_project=cloud-tpu-multipod-dev vertex_tensorboard_region=us-south1 checkpoint_storage_use_ocdbt=False checkpoint_storage_use_zarr3=False enable_pathways_goodput=True enable_goodput_recording=True enable_single_controller=True metrics_file=metrics.txt goodput_upload_interval_seconds=30  steps=20 model_name=llama3.1-8b base_output_directory=gs://chzheng-us-south1/chzhengpathways_2_slice_v5litepod-32_llama3_1-8b-8192-v5e-256/  run_name=chzhe-pw-2-tan enable_rich_metrics=true " --docker-image=gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest --enable-debug-logs --workload=chzhe-pw-2-tan --priority=medium --max-restarts=0

It will let the pod pull the target image directly from the remote

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  90s   default-scheduler  Successfully assigned default/chzhe-pw-2-tan-pathways-head-0-0-nscgd to gke-pw-scale-test-v5e-32-cpu-np-4be25352-u4fd
  Normal  Pulling    90s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest"
  Normal  Pulled     90s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest" in 389ms (389ms including waiting). Image size: 185513979 bytes.
  Normal  Created    90s   kubelet            Created container: pathways-rm
  Normal  Started    90s   kubelet            Started container pathways-rm
  Normal  Pulling    89s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest"
  Normal  Pulled     89s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest" in 288ms (288ms including waiting). Image size: 180371181 bytes.
  Normal  Created    89s   kubelet            Created container: pathways-proxy
  Normal  Started    89s   kubelet            Started container pathways-proxy
  Normal  Pulling    88s   kubelet            Pulling image "gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest"
  Normal  Pulled     88s   kubelet            Successfully pulled image "gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest" in 334ms (334ms including waiting). Image size: 1829862064 bytes.
  Normal  Created    88s   kubelet            Created container: jax-tpu
  Normal  Started    88s   kubelet            Started container jax-tpu

@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch 2 times, most recently from a32daf1 to f27717a Compare August 27, 2025 20:12
@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from f27717a to 16e76b1 Compare August 27, 2025 20:13
@ycchenzheng ycchenzheng requested a review from scaliby August 27, 2025 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants